Machine Learning Approaches to Understanding DNA Methylation in Cancers

Authors: Sam Coleman and Jacob Tye

Overview

This project explores the application of machine learning techniques to DNA methylation data from The Cancer Genome Atlas (TCGA) and cBioPortal.
The primary objectives are:

Data Sources

The analyses utilize publicly available data:

Data preparation scripts are located in the analysis/data_prep directory.

Note: Data is not included in GitHub repo due to GitHub sizing restraints. To get the data please see the analysis/data_prep directory and the samples used in data/methylation/all_samples_450K.tsv

Methodology

Tumor Classification

1.1 Local Linear Embedding with Logistic Regression

1.2 Principal Component Analysis with Gradient Boosting

Both models achieved over 97% accuracy on the testing dataset.

Breast Cancer Subtype Classification

2.1 Unsupervised:
K-Means clustering, resulting in a normalized mutual information (NMI) score of 0.2902.

2.2 Supervised:
Neural network following feature dimensionality reduction, achieving an NMI score of 0.58.

Results

The models effectively differentiate tumors from normal cells, indicating the potential of integrating methylation data with machine learning for early detection and diagnosis.
The lower performance in subtype classification suggests epigenetic heterogeneity within current categorization systems, highlighting opportunities for refinement.

Report

Presentation of project is in final_project_presentation.pptx

Full report for this project is in ML_Project_Report.pdf

Dataset References

  1. Cerami et al. "The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data." Cancer Discovery, May 2012. PubMed

  2. Gao et al. "Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal." Sci. Signal., 2013. PubMed

  3. de Bruijn et al. "Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal." Cancer Res, 2023. PubMed

The results presented are based on data generated by the TCGA Research Network: https://www.cancer.gov/tcga.